Add Vector Search semantic product discovery example by janniklasrose · Pull Request #153 · databricks/bundle-examples

janniklasrose · 2026-05-01T09:14:18Z

Summary

Adds a Declarative Automation Bundle under contrib/vector_search_product_discovery/ that demonstrates semantic product search end-to-end with Databricks Vector Search:

vector_search_endpoints + vector_search_indexes declared as bundle resources, with jobs referencing them via ${resources.*.name} so dev-mode prefixing flows through automatically
Direct Access index (engine: direct in databricks.yml); descriptions are embedded explicitly in 01_upsert_products.py and the query notebook embeds the query before calling similarity_search — Direct Access indexes don't auto-embed (that's a Delta Sync feature)
schema_json uses the flat {"col":"type"} form required by the API

Dependency

Requires databricks/cli#5123 (still open), which lands vector_search_indexes as a first-class DABs resource on the direct engine. Until that PR merges and ships in a CLI release, databricks bundle deploy against this example will fail to recognize the vector_search_indexes resource type.

Test plan

databricks bundle validate against a CLI built from Add vector_search_indexes resource (direct engine) cli#5123

databricks bundle plan

create jobs.product_discovery_query
create jobs.product_discovery_query.permissions
create jobs.product_discovery_setup
create jobs.product_discovery_setup.permissions
create schemas.product_search_schema
create vector_search_endpoints.product_search_endpoint
create vector_search_endpoints.product_search_endpoint.permissions
create vector_search_indexes.product_index

Plan: 8 to add, 0 to change, 0 to delete, 0 unchanged

databricks bundle deploy — endpoint reaches ONLINE, index created

Uploading bundle files to /Workspace/Users/[...]/.bundle/vector_search_product_discovery/prod/files...
Deploying resources...
Updating deployment state...
Deployment complete!

databricks bundle run product_discovery_setup — products embedded and upserted

2026-06-02 14:23:32 "product_discovery_setup" RUNNING
2026-06-02 14:24:23 "product_discovery_setup" TERMINATED SUCCESS

databricks bundle run product_discovery_query --params "query=footwear for slippery wet trails" — returns ranked results
```
2026-06-02 14:26:39 "product_discovery_query" RUNNING
2026-06-02 14:27:21 "product_discovery_query" TERMINATED SUCCESS
```
databricks bundle destroy — clean teardown

This pull request and its description were written by Isaac.

Demonstrates a Direct Access Vector Search index and endpoint declared as bundle resources (vector_search_endpoints, vector_search_indexes), tested e2e against staging with the direct engine. Key design decisions: - Jobs use resource references (${resources.*.name}) for endpoint and index names so dev-mode prefixing flows through automatically - schema_json uses flat {"col":"type"} format required by the API - Notebooks embed descriptions/queries explicitly (Direct Access indexes don't auto-embed; that's a Delta Sync feature) - engine: direct set in bundle config so no env var is needed Co-authored-by: Isaac

Co-authored-by: Isaac

pietern

Ran this example end-to-end on a dogfood workspace with the released CLI (v1.1.0): validate → deploy → run (setup + query) → destroy. The embed → upsert → similarity-search logic is correct — all three README example queries returned the documented top result, so the substance is solid. Also confirmed v1.1.0 recognizes vector_search_endpoints / vector_search_indexes, so the cli#5123 dependency has shipped (correctly struck through in the description).

Nice to see the index name reference ${resources.schemas.product_search_schema.name} rather than the raw ${var.schema} — that's the mode-prefix-safe form.

Remaining feedback is about per-deploy isolation and the CLI run experience, flagged inline. Nothing blocks the single-user happy path; it's mostly "what happens when a second person deploys this into the same workspace."

pietern · 2026-06-02T13:26:15Z

+resources:
+  vector_search_endpoints:
+    product_search_endpoint:
+      name: ${var.endpoint_name}


The endpoint name is a hardcoded, workspace-global value with no per-deploy uniqueness. Unlike jobs/schemas, vector_search_endpoints names aren't rewritten by any deployment mode, so every deploy of this example tries to create product-search-endpoint. A second user — or a second copy — gets 409 ALREADY_EXISTS; I hit exactly this against an existing endpoint while testing. The README already notes it "must be unique per workspace," but nothing in the bundle makes it so.

Options: treat the endpoint as a shared prerequisite/variable (endpoints are slow to provision and are designed to host many indexes), or bake something unique into the default (e.g. ${workspace.current_user.short_name}). This compounds with the single prod target (see the comment on databricks.yml).

pietern · 2026-06-02T13:26:16Z

+  prod:
+    mode: production
+    default: true


The bundle ships only a prod / mode: production target, set as the default. For a copy-me example that's worth reconsidering: mode: production applies no name prefixing, so a plain databricks bundle deploy creates unprefixed, shared-namespace resources — the product-search-endpoint endpoint, the main.product_search schema, and main.product_search.product_index. Two people trying the quickstart collide on all three, and it writes into a generic schema in main.

Most examples default to a dev target with mode: development, so the first thing a user runs produces an isolated, prefixed copy. Consider adding a dev default target (and/or namespacing the schema and endpoint), keeping prod for the production story.

pietern · 2026-06-02T13:26:16Z

+          }
+        embedding_vector_columns:
+          - name: description_vector
+            embedding_dimension: 1024


embedding_dimension has two sources of truth. databricks.yml defines an embedding_dimension variable (default "1024") and threads it through both notebooks, but this line hardcodes 1024 independently — and the value is immutable after index creation. An override like --var embedding_dimension=512 would silently produce vectors this index rejects. Consider referencing the variable so there's one knob:

Suggested change

embedding_dimension: 1024

embedding_dimension: ${var.embedding_dimension}

The variable defaults to the string "1024" (fine as a job param) while this field is an integer, so confirm the interpolation coerces — or declare the variable's type as integer.

pietern · 2026-06-02T13:26:16Z

+rows = results["result"]["data_array"]
+df = pd.DataFrame(rows, columns=result_columns)
+df.index += 1
+print(df.to_string())


Query results aren't visible from databricks bundle run — the path the README leads with. print(df.to_string()) only reaches the notebook cell output; bundle run product_discovery_query shows just RUNNING / TERMINATED SUCCESS, and jobs get-run-output comes back empty because the notebook never calls dbutils.notebook.exit(). For a demo whose payoff is the ranked list, consider also exiting with the result:

dbutils.notebook.exit(df.to_json(orient="records"))

(Keep the print for interactive use.) Otherwise the README should note that you open the run URL to see the output.

juliacrawf-db · 2026-06-02T23:23:45Z

Don't use real brands - make these fake - otherwise it can look like we officially endorse them.

juliacrawf-db · 2026-06-02T23:28:07Z

@@ -0,0 +1,157 @@
+# Vector Search: Semantic Product Discovery
+
+A Declarative Automation Bundle demonstrating **semantic product search** using


Suggested change

A Declarative Automation Bundle demonstrating **semantic product search** using

A Declarative Automation Bundle demonstrating semantic product search using

Nit. There's no reason to bold this.

juliacrawf-db · 2026-06-02T23:28:28Z

+# Vector Search: Semantic Product Discovery
+
+A Declarative Automation Bundle demonstrating **semantic product search** using
+[Databricks Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html).


I have heard "Vector Search" is being renamed?

juliacrawf-db · 2026-06-02T23:29:03Z

+Keyword search fails when shoppers use different words than what appears in product
+descriptions. A customer searching for *"something to keep my coffee hot all day"* won't
+match a product described as an *"insulated stainless water bottle with double-wall vacuum
+insulation"* — even though it's the right answer.


Suggested change

insulation"* — even though it's the right answer.

insulation" even though it's the right answer.

juliacrawf-db · 2026-06-02T23:29:12Z

+## The problem
+
+Keyword search fails when shoppers use different words than what appears in product
+descriptions. A customer searching for *"something to keep my coffee hot all day"* won't


Suggested change

descriptions. A customer searching for *"something to keep my coffee hot all day"* won't

descriptions. A customer searching for "something to keep my coffee hot all day" won't

juliacrawf-db · 2026-06-02T23:29:21Z

+
+Keyword search fails when shoppers use different words than what appears in product
+descriptions. A customer searching for *"something to keep my coffee hot all day"* won't
+match a product described as an *"insulated stainless water bottle with double-wall vacuum


Suggested change

match a product described as an *"insulated stainless water bottle with double-wall vacuum

match a product described as an "insulated stainless water bottle with double-wall vacuum

juliacrawf-db · 2026-06-02T23:29:42Z

+match a product described as an *"insulated stainless water bottle with double-wall vacuum
+insulation"* — even though it's the right answer.
+
+Semantic search using vector embeddings matches on **meaning**, not words.


Suggested change

Semantic search using vector embeddings matches on **meaning**, not words.

Semantic search using vector embeddings matches on meaning, not words.

juliacrawf-db · 2026-06-02T23:31:49Z

+## Prerequisites
+
+- Databricks workspace with Unity Catalog enabled
+- Databricks CLI that supports `vector_search_endpoints` / `vector_search_indexes` as bundle resources


I would just put the version unless there is a reason not to?

Suggested change

- Databricks CLI that supports `vector_search_endpoints` / `vector_search_indexes` as bundle resources

- Databricks CLI version 1.1.0 or above

juliacrawf-db · 2026-06-02T23:40:34Z

+- Databricks CLI that supports `vector_search_endpoints` / `vector_search_indexes` as bundle resources
+- An existing Unity Catalog catalog (default: `main`)
+
+## Quick start


I'd probably call this Usage?

juliacrawf-db · 2026-06-02T23:41:55Z

+   databricks auth login --host https://your-workspace.cloud.databricks.com
+   ```
+
+2. **Configure** `databricks.yml` — set the workspace host and any variable overrides


Suggested change

2. **Configure** `databricks.yml` — set the workspace host and any variable overrides

2. Configure `databricks.yml`. Set the workspace host and any variable overrides.

I'm not a fan of the bolded style of these steps, but minor nit.

juliacrawf-db · 2026-06-02T23:42:12Z

+
+## Quick start
+
+1. **Authenticate**


Suggested change

1. **Authenticate**

1. Authenticate the CLI:

juliacrawf-db · 2026-06-02T23:42:34Z

+
+2. **Configure** `databricks.yml` — set the workspace host and any variable overrides
+
+3. **Deploy** — creates the schema, endpoint, index, jobs, and syncs `data/products.json`


Suggested change

3. **Deploy** — creates the schema, endpoint, index, jobs, and syncs `data/products.json`

3. Deploy the bundle. This creates the schema, endpoint, index, jobs, and syncs `data/products.json`.

juliacrawf-db · 2026-06-02T23:43:04Z

+   ```
+   > Vector Search endpoint creation takes a few minutes to reach ONLINE status.
+
+4. **Load the catalog** — embeds all product descriptions and upserts them into the index


Suggested change

4. **Load the catalog** — embeds all product descriptions and upserts them into the index

4. Load the catalog by running the bundle. This embeds all product descriptions and upserts them into the index.

juliacrawf-db · 2026-06-02T23:44:02Z

+   databricks bundle run product_discovery_setup
+   ```
+
+5. **Search** — pass any natural-language query


Suggested change

5. **Search** — pass any natural-language query

5. Pass any natural-language query to search.

juliacrawf-db · 2026-06-02T23:44:19Z

+   databricks bundle run product_discovery_query --params "query=footwear for slippery wet trails"
+   ```
+
+6. **Or open** `src/02_query_demo.py` in your workspace to run queries interactively


Suggested change

6. **Or open** `src/02_query_demo.py` in your workspace to run queries interactively

6. Or open `src/02_query_demo.py` in your workspace to run queries interactively:

juliacrawf-db · 2026-06-02T23:45:03Z

+## Bundle resources
+
+| Resource | Type | Description |
+|---|---|---|
+| `product_search_schema` | `schemas` | Unity Catalog schema that namespaces the index |
+| `product_search_endpoint` | `vector_search_endpoints` | Managed ANN serving endpoint |
+| `product_index` | `vector_search_indexes` | Direct Access index — schema defined in `resources/index.yml` |
+| `product_discovery_setup` | `jobs` | Embeds product descriptions and upserts into the index |
+| `product_discovery_query` | `jobs` | Embeds a query and returns ranked results |


This section feels misplaced - this is part of the project structure below. And in fact, maybe delete this or merge the two?

Or actually maybe the opposite - put the project structure here because that is a nice overview and then merge the descriptions with that?

juliacrawf-db · 2026-06-02T23:47:59Z

+`direct_access_index_spec` with `index_type: DELTA_SYNC` and `delta_sync_index_spec` in
+`resources/index.yml`, and remove the upsert job.
+
+## Project structure


Move this up higher? It sure seems like you want this info before all the descriptions of the resources/files.

juliacrawf-db · 2026-06-02T23:50:57Z

+A Declarative Automation Bundle demonstrating **semantic product search** using
+[Databricks Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html).
+
+## The problem


I would probably put this as the first paragraph of "How it works" instead of a separate "The problem" section.

juliacrawf-db · 2026-06-02T23:53:14Z

@@ -0,0 +1,157 @@
+# Vector Search: Semantic Product Discovery
+
+A Declarative Automation Bundle demonstrating **semantic product search** using


I think I get it after reading this whole page and usage, etc, but I think it might have helped if this first sentence expanded a bit on what using a bundle helps me do here (even if it's obvious that it is automating setup - say that).

juliacrawf-db · 2026-06-02T23:58:14Z

Just calling it out: users will take the files names here and copy directly when they use this example to create their own templates with this resource. So is "endpoint.yml" a good (best practice) file name to contain vector search endpoint definitions? (And same for others.)

janniklasrose added 2 commits April 30, 2026 23:59

Apply ruff format to upsert/query notebooks

2e83d4c

Co-authored-by: Isaac

janniklasrose requested review from andrewnester, denik and pietern May 27, 2026 15:57

janniklasrose added 2 commits June 1, 2026 15:26

Use schema resource reference

8801820

Cleanup

3a06d9f

pietern reviewed Jun 2, 2026

View reviewed changes

Comment thread contrib/vector_search_product_discovery/resources/index.yml Outdated

Comment thread knowledge_base/vector_search_product_discovery/resources/setup_job.yml

janniklasrose added 3 commits June 2, 2026 14:55

Move files

fa94643

Format schema_json

24daec9

Strip .job.yml from .yml files

a9a8957

pietern reviewed Jun 2, 2026

View reviewed changes

juliacrawf-db reviewed Jun 2, 2026

View reviewed changes

	embedding_dimension: 1024
	embedding_dimension: ${var.embedding_dimension}

		@@ -0,0 +1,157 @@
		# Vector Search: Semantic Product Discovery

		A Declarative Automation Bundle demonstrating semantic product search using

	insulation"* — even though it's the right answer.
	insulation" even though it's the right answer.

	descriptions. A customer searching for "something to keep my coffee hot all day" won't
	descriptions. A customer searching for "something to keep my coffee hot all day" won't

	match a product described as an *"insulated stainless water bottle with double-wall vacuum
	match a product described as an "insulated stainless water bottle with double-wall vacuum

	Semantic search using vector embeddings matches on meaning, not words.
	Semantic search using vector embeddings matches on meaning, not words.

	- Databricks CLI that supports `vector_search_endpoints` / `vector_search_indexes` as bundle resources
	- Databricks CLI version 1.1.0 or above

	2. Configure `databricks.yml` — set the workspace host and any variable overrides
	2. Configure `databricks.yml`. Set the workspace host and any variable overrides.


		2. Configure `databricks.yml` — set the workspace host and any variable overrides

		3. Deploy — creates the schema, endpoint, index, jobs, and syncs `data/products.json`

	3. Deploy — creates the schema, endpoint, index, jobs, and syncs `data/products.json`
	3. Deploy the bundle. This creates the schema, endpoint, index, jobs, and syncs `data/products.json`.

	4. Load the catalog — embeds all product descriptions and upserts them into the index
	4. Load the catalog by running the bundle. This embeds all product descriptions and upserts them into the index.

	5. Search — pass any natural-language query
	5. Pass any natural-language query to search.

	6. Or open `src/02_query_demo.py` in your workspace to run queries interactively
	6. Or open `src/02_query_demo.py` in your workspace to run queries interactively:

Conversation

janniklasrose commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependency

Test plan

Uh oh!

Uh oh!

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliacrawf-db Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliacrawf-db Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliacrawf-db Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliacrawf-db Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliacrawf-db Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

janniklasrose commented May 1, 2026 •

edited

Loading

juliacrawf-db Jun 2, 2026 •

edited

Loading

juliacrawf-db Jun 2, 2026 •

edited

Loading

juliacrawf-db Jun 2, 2026 •

edited

Loading

juliacrawf-db Jun 2, 2026 •

edited

Loading

juliacrawf-db Jun 2, 2026 •

edited

Loading